Introduction

Data

The project focuses on exploratory data analysis of the vast NYC Airbnb data available from guests and hosts since 2008. The dataset describes the listing activity and metrics in NYC for 2019.For more details regarding the package, please refer to the Dataset.

Goal

The goal is to create a smart pricing model from this dataset. It will help customers understand the influencing factors affecting the price appropriately.Hosts too can base their pricing for their listings based on this model.

How does the the number of reviews, stay duration, availability and neighburhood vary in signficance to the price of the listing.

Modeling

My target variable is Price and I plan to model it with linear regression and compare the effects of different predictor variables on Price. Number or reviews, availability, duration of stay and the area would be strong predictors.

Import & Check Data

Selecting Variables

Review/month - All NA assigned to 0. These are rows of listings that havent received any review.

# Dropping the above mentioned two columns

nyc <- nyc %>% 
  select(-host_id, -last_review)

# Replacing all NA's in reviews/month to 0. Corresponds to all listings with 0 reviews

nyc$reviews_per_month[is.na(nyc$reviews_per_month)] <- 0

nyc %>% summary
##        id                                         name      
##  Min.   :    2539   Hillside Hotel                  :   18  
##  1st Qu.: 9471945   Home away from home             :   17  
##  Median :19677284                                   :   16  
##  Mean   :19017143   New york Multi-unit building    :   16  
##  3rd Qu.:29152178   Brooklyn Apartment              :   12  
##  Max.   :36487245   Loft Suite @ The Box House Hotel:   11  
##                     (Other)                         :48805  
##         host_name        neighbourhood_group            neighbourhood  
##  Michael     :  417   Bronx        : 1091    Williamsburg      : 3920  
##  David       :  403   Brooklyn     :20104    Bedford-Stuyvesant: 3714  
##  Sonder (NYC):  327   Manhattan    :21661    Harlem            : 2658  
##  John        :  294   Queens       : 5666    Bushwick          : 2465  
##  Alex        :  279   Staten Island:  373    Upper West Side   : 1971  
##  Blueground  :  232                          Hell's Kitchen    : 1958  
##  (Other)     :46943                          (Other)           :32209  
##     latitude       longitude                room_type    
##  Min.   :40.50   Min.   :-74.24   Entire home/apt:25409  
##  1st Qu.:40.69   1st Qu.:-73.98   Private room   :22326  
##  Median :40.72   Median :-73.96   Shared room    : 1160  
##  Mean   :40.73   Mean   :-73.95                          
##  3rd Qu.:40.76   3rd Qu.:-73.94                          
##  Max.   :40.91   Max.   :-73.71                          
##                                                          
##      price         minimum_nights    number_of_reviews reviews_per_month
##  Min.   :    0.0   Min.   :   1.00   Min.   :  0.00    Min.   : 0.000   
##  1st Qu.:   69.0   1st Qu.:   1.00   1st Qu.:  1.00    1st Qu.: 0.040   
##  Median :  106.0   Median :   3.00   Median :  5.00    Median : 0.370   
##  Mean   :  152.7   Mean   :   7.03   Mean   : 23.27    Mean   : 1.091   
##  3rd Qu.:  175.0   3rd Qu.:   5.00   3rd Qu.: 24.00    3rd Qu.: 1.580   
##  Max.   :10000.0   Max.   :1250.00   Max.   :629.00    Max.   :58.500   
##                                                                         
##  calculated_host_listings_count availability_365
##  Min.   :  1.000                Min.   :  0.0   
##  1st Qu.:  1.000                1st Qu.:  0.0   
##  Median :  1.000                Median : 45.0   
##  Mean   :  7.144                Mean   :112.8   
##  3rd Qu.:  2.000                3rd Qu.:227.0   
##  Max.   :327.000                Max.   :365.0   
## 

Inspect & Transform : Individual variables

Categorical Variables

# Neighborhood, Neighborhood_group and room-type can be viewed in a tabular format


#Neighborhood_Group Inspect

nyc$neighbourhood_group %>% table 
## .
##         Bronx      Brooklyn     Manhattan        Queens Staten Island 
##          1091         20104         21661          5666           373
# Categorizing by Room type

nyc$room_type %>% table
## .
## Entire home/apt    Private room     Shared room 
##           25409           22326            1160
#Neighborhood_Group Inspect
summary(nyc$neighbourhood) %>% head
##       Williamsburg Bedford-Stuyvesant             Harlem 
##               3920               3714               2658 
##           Bushwick    Upper West Side     Hell's Kitchen 
##               2465               1971               1958
# Many hosts have multiple listings
nyc$calculated_host_listings_count %>% unique %>% length
## [1] 47

We see that Brooklyn and Manhattan has the most listings , with Williamsburg, Bedford and Harlem being the top areas. Private rooms and Entire homes/ apartments take up the dominant house types in New York However, Staten Island and Bronx are underrepresented due to comparitevely less population than the other boroughs.

Continious Variables

# Price Inspect

hist(nyc$price, main = "Price Distribution", xlim = c(5,1000))

# Availability Inspect

hist(nyc$availability_365, main = "Availibility Distribution")

# No. of reviews Inspect

hist(nyc$number_of_reviews, main = "Reviews Distribution")

# Min. no. of nights Inspect

hist(nyc$minimum_nights, main = "Min number of nights")

The availibility histogram is broadly spread and price has a mean of $152, with some outliers towards the higher end.

Number of reviews also shows a skewed distribution. Minimum number of nights will be transformed into a binary variable, 3 nights and more of stay would be 1. Less than 3 nights would be 0. Availability will also be split into a categorical variable indicating popularity, more avilable the listing, less popular.

A log transformation and removal of outliers would look more better for our analysis.

  1. Did a interquartile function on price and number of reviews to remove outliers above the 3rd quartile
  2. Price shows a better result and will need to be logged.
# Removing outliers for Price variable


outlier_cutoff <- quantile(nyc$price, 0.75) + 1.5 * IQR(nyc$price)
index_outlier_ROT <- which(nyc$price > outlier_cutoff)
nyc <- nyc[-index_outlier_ROT, ]

plot(nyc$price)

hist(nyc$price)

  1. Number of reviews still look skewed and need to be logged
#Removing outliers for Number of reviews variable
outlier_cutoff2 <- quantile(nyc$number_of_reviews, 0.75) + 1.5 * IQR(nyc$number_of_reviews)
index_outlier_ROT <- which(nyc$number_of_reviews > outlier_cutoff2)
nyc <- nyc[-index_outlier_ROT, ]

plot(nyc$number_of_reviews)

hist(nyc$number_of_reviews)

#Fitting the skewed variables by log function
nyc$logprice = log1p(nyc$price)
nyc$logreviews = log1p(nyc$number_of_reviews)
nyc$logavailability = log1p(nyc$availability_365)

Bar plot of neighbourhood group split by the type of home

Our research delves into most visited neighbourhood boroughs to give a perspective on how our smart pricing would be beneficial for a larger customer base.

Private rooms and entire homes/apartments seem to be frequently booked home type. Boorklyn & Manhattan would be the focus of our model

# Plotting neighborhood group wise listings split by apartment type
ggplot(data = nyc) +
    geom_bar(mapping = aes(x=room_type, fill=neighbourhood_group),position="dodge")

Doing the required transformations on these continious variables

Clustering the continious variable into a categorical variable by splitting the availability to show popularity index. Comparing the popularity index with the price.

Converting minimum number of nights into a categorical variable, with more than 3 nights shown as 1 and remaining as 0.

# Determining the popularity index of all listings - analyzing few available days or many available days
nyc <- nyc %>% mutate(
  Rarefind = case_when(
    availability_365 < 75 ~ 'Popular', 
    availability_365 >= 75 & availability_365 < 250 ~ 'Good',
    availability_365 >= 250 ~ 'Okay')
)

ggplot(data = nyc) +
    geom_bar(mapping = aes(x=Rarefind, fill=mean(price)),position="dodge")

# Finding mean price for each of these categories 
nyc %>% select(Rarefind, price) %>% filter(Rarefind == "Popular") %>% summarise(mean = mean(price))
##       mean
## 1 116.8352
nyc %>% select(Rarefind, price) %>% filter(Rarefind == "Okay") %>% summarise(mean = mean(price))
##       mean
## 1 128.2417
nyc %>% select(Rarefind, price) %>% filter(Rarefind == "Good") %>% summarise(mean = mean(price))
##       mean
## 1 123.1372
# Creating a binary version of Min number of nights ( Splitting into more than 3 Nights and less than 3 Nights)
nyc$staytime <- ifelse(nyc$minimum_nights > median(nyc$minimum_nights), 1, 0)

nyc$staytime %>% table
## .
##     0     1 
## 25586 14485

The bar plot shows the split of listings in the popularity index we created. There seems to be most of the listings that are booked throughout the year and fall into the Popular segment(~24000). Good and Okay has around 8000 listings each. Checking mean price for each cluster we created : Popular - 116.83 Okay - 128.24 Good - 123.13 This confirms the trend to show the most booked ones are priced comparitively less than the ones that are available more.

Price & Dependent Variables

#Price Vs Min number of nights


g3 <- nyc %>% 
  ggplot(aes(x=logprice, fill=factor(nyc$staytime))) +
  geom_density(alpha=0.5) +
  ggtitle('Grouped density plot: Staytime x Price') +
  theme(legend.position="bottom")
g3

# Price Vs Availability

g4 <- nyc %>% 
  ggplot(aes(x=Rarefind, y=logprice, fill=Rarefind)) +
  geom_boxplot() + 
  ggtitle('Boxplot: Price x Availability') +
  theme(legend.position="bottom")
g4

a1 <- ggplot(data = nyc, aes(x = logavailability, y = logprice)) + geom_point(stat = "identity")
a1

Price relation to Min number of nights

  • We see the price of the listings where customers stay more than 3 nights have a significantly shorter density curve than the listings where customers stay for less than 3 days. It confirms that customers opt for listings with slightly lesser average price when they need to stay for a long duration.

  • Verifying the above shown relation between popularity (Rarefind variable) and price through a box plot

Price relation to Availability

The popular listings that are least available have the lowest prices. The listings that are most available are the most expensive.

Price relation to No. of reviews

The distribution of the number of reviews is highly skewed however we see it. This is because there are lot of listings which only get few reviews and even more listings which get lot of reviews. With the reviews Vs availability scatter plot we see no evidence to show any kind of relationship between them. Correlation shows 0.10

# Price Vs No. of reviews


g1 <- nyc %>% ggplot(aes(x=logreviews, y = logprice)) +
                       geom_point(alpha =0.2) + 
                       geom_smooth(method = 'lm')+
                       ggtitle('Scatterplot: number of reviews x Price')
g1

g2 <- nyc %>% ggplot(aes(x=logreviews, y = logavailability)) +
                       geom_point(alpha =0.2) + 
                       geom_smooth(method = 'lm')+
                       ggtitle('Scatterplot: number of reviews x Availibility')
g2

print(cor(nyc$logreviews, nyc$availability_365))
## [1] 0.1048355

Testing for collinearity

print(cor(nyc$logprice, nyc$logreviews))
## [1] -0.02116012
print(cor(nyc$logprice, nyc$logavailability))
## [1] 0.05030661
print(cor(nyc$logprice, nyc$minimum_nights))
## [1] 0.02272508

Neither result showed a value of more than 0.8, so that confirms there is no multi collinearity among these continious variables The corrlation coefficient is low and the plot doesnt show a strong pattern. So we shouldnt be concerned with multicollinearity Non linearity of this plot can be fixed with transforming the individual variables to have a normal distribution

Distribution of the new variable ‘Rarefind’

nyc %>% 
  ggplot(aes(longitude, latitude)) + 
  geom_hex() + 
  scale_fill_gradient(low = 'yellow', high = 'purple', breaks = c(200, 800)) + 
  labs(x = 'Longitude', y = 'Latitude') + 
  facet_wrap(~ Rarefind) + 
  theme(legend.position = 'bottom')

It seems that the the popular listings are confined in a certain part of the NY map. It could be a combination of more listings and expensive rooms.

nyc %>% 
   
  arrange(logprice) %>% 
  ggplot() +
    geom_point(aes(x = longitude, y = latitude, color = logprice)) +
    # scale_color_distiller(palette = "Spectral") +
    scale_colour_gradient(low = "gray87", high = "red") +
    labs(title = "Price heat map of Airbnb apartments",
          sep = "",
         color = "Price") +
    theme_minimal()

We create a heatmap to detect the price and it shows the most expensive homes that coincides with the above visual.

# Displaying count of listings comparitevely cheaper and more frequently avaialable in each neighbourhood
nyc %>% select(name, Rarefind, price, neighbourhood_group, neighbourhood) %>%
  filter(Rarefind == 'Good', price < mean(price)) %>%
  .$neighbourhood_group %>% table %>% sort
## .
## Staten Island         Bronx        Queens     Manhattan      Brooklyn 
##            69           248          1018          1244          2028

To further exploration of ideal listings users can browse, I have shown a glimpse of the bouroughs that has listings that could raise interest Price factor - Cheap Availability - Easy to book

Modeling

Linear Regression Model

lm1 = lm(formula = logprice ~  nyc$staytime + nyc$latitude + nyc$longitude + logreviews + logavailability + nyc$room_type,
               data = nyc)
summary(lm1)
## 
## Call:
## lm(formula = logprice ~ nyc$staytime + nyc$latitude + nyc$longitude + 
##     logreviews + logavailability + nyc$room_type, data = nyc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1281 -0.2708 -0.0056  0.2527  1.9315 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -2.624e+02  3.838e+00  -68.36   <2e-16 ***
## nyc$staytime              -1.405e-01  4.369e-03  -32.15   <2e-16 ***
## nyc$latitude               1.126e+00  3.676e-02   30.63   <2e-16 ***
## nyc$longitude             -2.996e+00  4.574e-02  -65.49   <2e-16 ***
## logreviews                -2.999e-02  1.674e-03  -17.92   <2e-16 ***
## logavailability            2.974e-02  8.470e-04   35.12   <2e-16 ***
## nyc$room_typePrivate room -7.196e-01  4.209e-03 -170.99   <2e-16 ***
## nyc$room_typeShared room  -1.109e+00  1.290e-02  -85.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4036 on 40063 degrees of freedom
## Multiple R-squared:  0.5194, Adjusted R-squared:  0.5193 
## F-statistic:  6185 on 7 and 40063 DF,  p-value: < 2.2e-16

All variables are significant. Median residual error is -0.005 and R^2 = 0.519 means the model explains about 52% of the variance of target variable, either of which is not good. Residual standard error is at 40% which means the probability of the model having an error of 40%

Interpreting Coefficients

lm1  %>% coefficients %>% exp %>% round(3)
##               (Intercept)              nyc$staytime 
##                     0.000                     0.869 
##              nyc$latitude             nyc$longitude 
##                     3.084                     0.050 
##                logreviews           logavailability 
##                     0.970                     1.030 
## nyc$room_typePrivate room  nyc$room_typeShared room 
##                     0.487                     0.330

Negative Correlation

Staytime The odds of price decrease to a listing when a customer stays more than 3 days (Staytime is a binary variable ,>3 days & <3 days minimum stay) is approximately 13%

Number of reviews The price tends to decrease approximately ~3% for as review increases for the listings

Room type

nyc %>% select(room_type, logprice) %>% filter(room_type == "Entire home/apt") %>% summarise(mean = mean(logprice))
##       mean
## 1 5.025237
nyc %>% select(room_type, logprice) %>% filter(room_type == "Private room") %>% summarise(mean = mean(logprice))
##       mean
## 1 4.274782
nyc %>% select(room_type, logprice) %>% filter(room_type == "Shared room") %>% summarise(mean = mean(logprice))
##       mean
## 1 3.924415

What is shown here is that how much significant the room type is to the price. This coefficient states that price of shared room and private room is comparetively lower than an entire home/apt. Private room is about 52% lower in cost than an entire home. Shared room shows to be 67% cheaper in price than an entire home.

Positive Correlation

Availability Price increases by ~3% when the listing is comparitively more available

Evaluating Predictive Performance

Data Splitting

Training set will be 80% of the original dataset. Listings with price = 0 would not be considered as it would help make our predictive model significantly stronger.

set.seed(200)

split = sample.split(nyc$price,SplitRatio = 0.8)
nyc_train = subset(nyc, split == TRUE)
nyc_test = subset(nyc, split == FALSE)

Now that we have both train and test dataset , we can train a model on the training set and make predictions on the training and test datasets. For linear regression ,we just need the predicted value of ‘Y’ = Price

#Fit model into training data

lm_train <- lm(formula = logprice ~  staytime + logreviews + logavailability + room_type,
               data = nyc_train)
summary(lm_train)
## 
## Call:
## lm(formula = logprice ~ staytime + logreviews + logavailability + 
##     room_type, data = nyc_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0716 -0.2969 -0.0065  0.2813  1.8147 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.0716384  0.0056289  901.00   <2e-16 ***
## staytime              -0.1103303  0.0051581  -21.39   <2e-16 ***
## logreviews            -0.0340566  0.0019782  -17.22   <2e-16 ***
## logavailability        0.0223741  0.0009948   22.49   <2e-16 ***
## room_typePrivate room -0.7704830  0.0049041 -157.11   <2e-16 ***
## room_typeShared room  -1.1513935  0.0152242  -75.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4274 on 32055 degrees of freedom
## Multiple R-squared:  0.4609, Adjusted R-squared:  0.4609 
## F-statistic:  5482 on 5 and 32055 DF,  p-value: < 2.2e-16

Median residual standard error is almost the same(-0.04) and R^2 is around 40% is due to all these dependent variables. What we can do is cluster neighbourhoods that we wish to focus and try to derive the most positive impactful predictors. It shows a large residual error at 53% that needs to be looked at.

Predicting the prices using the above model

Apply on Train data

#Predicting the train set Results

y_pred = predict(lm_train, newdata = nyc_train)
y_pred %>% head
##        1        2        4        5        6        7 
## 4.354803 5.072694 4.882890 4.057595 4.868995 4.847824
predicted_price <- c(y_pred %>% exp %>% round(2))
head(predicted_price)
##      1      2      4      5      6      7 
##  77.85 159.60 132.01  57.84 130.19 127.46

Showing the predicted price estimate for listings in the train dataset. This can be used by our hosts with individual properties as well as multiple properties to make a price list and help improve their business through strategic decsion-making

Applying on Test Data

#Fit model into test data

lm_test <- lm(formula = logprice ~  staytime + logreviews + logavailability + room_type,
               data = nyc_test)
summary(lm_test)
## 
## Call:
## lm(formula = logprice ~ staytime + logreviews + logavailability + 
##     room_type, data = nyc_test)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2097 -0.2922 -0.0009  0.2842  1.6287 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.067780   0.011270 449.652   <2e-16 ***
## staytime              -0.118147   0.010240 -11.538   <2e-16 ***
## logreviews            -0.035113   0.003977  -8.828   <2e-16 ***
## logavailability        0.023881   0.001994  11.977   <2e-16 ***
## room_typePrivate room -0.768039   0.009782 -78.515   <2e-16 ***
## room_typeShared room  -1.149108   0.030735 -37.388   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4273 on 8004 degrees of freedom
## Multiple R-squared:  0.4613, Adjusted R-squared:  0.461 
## F-statistic:  1371 on 5 and 8004 DF,  p-value: < 2.2e-16

The outcome of the test data seems very similar to the training data. Residual error and R^2 increases to 42% again. It kind of concludes that most variation in price is still not explained by the model. Residual standard error = 52%

#Predicting the test set Results

y_pred2 = predict(lm_test, newdata = nyc_test)
y_pred2 %>% head
##        3        9       14       15       22       25 
## 4.440702 5.081491 4.909880 4.930973 4.977054 4.300834
predicted_price2 <- c(y_pred2 %>% exp %>% round(2))
head(predicted_price2)
##      3      9     14     15     22     25 
##  84.83 161.01 135.62 138.51 145.05  73.76

Test data shows rows 7 to 30 and their predicted price estimate.

Improving Model Performance by Clustering Neighbourhoods:

  • Select the top visited and booked neighbourhoods and analyze the price Vs predictor significance to obtain better accuracy
  • Construct a base line price estimate for lsits with more availability to increase chances of a attracting a customer.
nyc_main <- nyc %>% 
  select(logprice,host_name, room_type, neighbourhood_group, neighbourhood, logavailability, logreviews, staytime) %>% 
  filter(neighbourhood %in% c("Williamsburg",  "Bedford-Stuyvesant", "Harlem",  "Bushwick", "Upper West Side"))


# Modeling with linear regression

lm3 = lm(formula = logprice ~ neighbourhood + staytime +  logreviews + logavailability + room_type,
               data = nyc_main)
summary(lm3)
## 
## Call:
## lm(formula = logprice ~ neighbourhood + staytime + logreviews + 
##     logavailability + room_type, data = nyc_main)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0682 -0.2417 -0.0137  0.2221  1.7032 
## 
## Coefficients:
##                               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                   4.822065   0.010452  461.363  < 2e-16 ***
## neighbourhoodBushwick        -0.032278   0.010501   -3.074  0.00212 ** 
## neighbourhoodHarlem           0.143532   0.010439   13.750  < 2e-16 ***
## neighbourhoodUpper West Side  0.373718   0.011618   32.168  < 2e-16 ***
## neighbourhoodWilliamsburg     0.273803   0.009426   29.049  < 2e-16 ***
## staytime                     -0.145837   0.007194  -20.272  < 2e-16 ***
## logreviews                   -0.015729   0.002850   -5.520 3.47e-08 ***
## logavailability               0.032528   0.001445   22.505  < 2e-16 ***
## room_typePrivate room        -0.729661   0.007089 -102.925  < 2e-16 ***
## room_typeShared room         -1.092003   0.024561  -44.460  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3705 on 12278 degrees of freedom
## Multiple R-squared:  0.5644, Adjusted R-squared:  0.5641 
## F-statistic:  1768 on 9 and 12278 DF,  p-value: < 2.2e-16

Median residual error is the same - -0.013 r^2 increases to 56% which is good and residual standard error is the lowest yet at 37%

Conducting similar linear regression on nyc_main dataset that comprises of only the listings from the highest customer base neighbourhoods ### Applying on Train & Test data

Splitting Data

# Fitting model into train dataset
library(caTools)

set.seed(200)

split = sample.split(nyc_main$logprice,SplitRatio = 0.8)
nyc_tr = subset(nyc_main, split == TRUE)
nyc_ts = subset(nyc_main, split == FALSE)

lm_tr <- lm(formula = logprice ~  staytime + logreviews + logavailability + room_type,
               data = nyc_tr)
summary(lm_tr)
## 
## Call:
## lm(formula = logprice ~ staytime + logreviews + logavailability + 
##     room_type, data = nyc_tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1735 -0.2660 -0.0145  0.2498  1.8898 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.021388   0.009795 512.622  < 2e-16 ***
## staytime              -0.133899   0.008596 -15.577  < 2e-16 ***
## logreviews            -0.019118   0.003393  -5.635  1.8e-08 ***
## logavailability        0.026632   0.001715  15.533  < 2e-16 ***
## room_typePrivate room -0.793755   0.008295 -95.685  < 2e-16 ***
## room_typeShared room  -1.168464   0.028634 -40.806  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3968 on 9832 degrees of freedom
## Multiple R-squared:  0.5026, Adjusted R-squared:  0.5023 
## F-statistic:  1987 on 5 and 9832 DF,  p-value: < 2.2e-16
#Predicting the train set Results

y_pred3 = predict(lm_tr, newdata = nyc_tr)
y_pred3 %>% head
##        1        2        3        4        5        7 
## 4.384834 4.018945 4.863052 4.977368 4.250935 4.304228
predicted_price3 <- c(y_pred3 %>% exp %>% round(2))
head(predicted_price3)
##      1      2      3      4      5      7 
##  80.22  55.64 129.42 145.09  70.17  74.01

Median Residual Error = -0.01 R^2 decreased to 50% for the train dataset.Residual Standard error has dropped to 39% which is good.

Displaying the predicted price for the listings in the dataset.

#Fit model into test data

lm_ts <- lm(formula = logprice ~  staytime + logreviews + logavailability + room_type,
               data = nyc_ts)
summary(lm_ts)
## 
## Call:
## lm(formula = logprice ~ staytime + logreviews + logavailability + 
##     room_type, data = nyc_ts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9668 -0.2673 -0.0110  0.2424  1.7034 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.978526   0.019606 253.929  < 2e-16 ***
## staytime              -0.104241   0.017385  -5.996 2.32e-09 ***
## logreviews            -0.018687   0.006988  -2.674  0.00754 ** 
## logavailability        0.027521   0.003491   7.884 4.73e-15 ***
## room_typePrivate room -0.765381   0.016690 -45.859  < 2e-16 ***
## room_typeShared room  -1.157843   0.066435 -17.428  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4009 on 2444 degrees of freedom
## Multiple R-squared:  0.4811, Adjusted R-squared:   0.48 
## F-statistic: 453.2 on 5 and 2444 DF,  p-value: < 2.2e-16
y_pred4 = predict(lm_ts, newdata = nyc_ts)
y_pred4 %>% head
##        6       11       17       20       37       46 
## 4.932090 4.315148 5.076349 4.240914 4.902244 4.907579
predicted_price4 <- c(y_pred4 %>% exp %>% round(2))
head(predicted_price4)
##      6     11     17     20     37     46 
## 138.67  74.82 160.19  69.47 134.59 135.31

There is not an overall difference in the outcome of the test data that suggests a high possibility of explaining the significance of the predictors as compared to the analysis done before. Residual standard error = 40% and R^2 = 48%

regression_results <- tibble(
  obs = nyc_tr$logprice,
  pred = y_pred3,
  diff = pred - obs,
  abs_diff = abs(pred - obs),
  neighbourhood = nyc_tr$neighbourhood,
  name = nyc_tr$host_name,
  group = nyc_tr$neighbourhood_group,
  type = nyc_tr$room_type
  
)

regression_plot <- regression_results %>% 
  ggplot(aes(obs, pred)) +
geom_point(alpha = 0.1, aes(text = paste("Name:", name, "\nGroup:", group, "\nType:", type,
                                           "\nPrice diff = ", diff))) +
  theme() +
  scale_x_log10() +
  scale_y_log10() +
  ggtitle("Observed vs predicted",
          subtitle = "Linear regression model") + 
  geom_abline(slope = 1, intercept = 0, color = "blue", linetype = 2)  +
  facet_wrap(~neighbourhood)

ggplotly(regression_plot)
  • All the 5 neighbourhood in focus have been displayed with observed and predicted prices through an interactive dashbaord
  • Detailed description of the property that can be accessed by customers and hosts alike.

Conclusion